-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(ingestion): Make PySpark optional for S3, ABS, and Unity Catalog sources #15123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… sources PySpark and PyDeequ have been required dependencies for S3, ABS, and Unity Catalog sources, even when profiling is disabled. This creates unnecessary installation overhead (~500MB) and potential dependency conflicts for users who don't need profiling capabilities. **PySpark Detection Framework** - Added `pyspark_utils.py` with centralized availability detection - Graceful fallback when PySpark/PyDeequ unavailable - Clear error messages guiding users to install dependencies when needed **Modular Installation Options** - S3/ABS/GCS sources now work without PySpark when profiling is disabled - New `data-lake-profiling` extra for modular PySpark installation - Convenience extras: `s3-profiling`, `gcs-profiling`, `abs-profiling` - Unity Catalog gracefully falls back to sqlglot when PySpark unavailable **Config Validation** - Added validators to S3/ABS configs to check PySpark availability at config time - Validates profiling dependencies before attempting to use them - Provides actionable error messages with installation instructions **Installation Examples** ```bash pip install 'acryl-datahub[s3]' pip install 'acryl-datahub[s3,data-lake-profiling]' pip install 'acryl-datahub[s3-profiling]' ``` **Dependencies** - PySpark ~=3.5.6 (in `data-lake-profiling` extra) - PyDeequ >=1.1.0 (data quality validation) **Benefits** - Reduced footprint: Base installs ~500MB smaller without PySpark - Faster installs: No PySpark compilation for non-profiling users - Better errors: Clear messages when profiling needs PySpark - Flexibility: Users choose profiling support level - Backward compatible: Existing installations continue working **Testing** - Added 46+ unit tests validating optional PySpark functionality - Tests cover availability detection, config validation, and graceful fallbacks - All existing tests continue to pass See docs/PYSPARK.md for detailed installation and usage guide.
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (59.77%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. 📢 Thoughts on this report? Let us know! |
metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py
Outdated
Show resolved
Hide resolved
Flips the implementation to maintain backward compatibility while providing lightweight installation options. S3, GCS, and ABS sources now include PySpark by default, with new -slim variants for PySpark-less installations. **Changes:** 1. **Setup.py - Default PySpark inclusion:** - `s3`, `gcs`, `abs` extras now include `data-lake-profiling` by default - New `s3-slim`, `gcs-slim`, `abs-slim` extras without PySpark - Ensures existing users have no breaking changes - Naming aligns with Docker image conventions (slim/full) 2. **Config validation removed:** - Removed PySpark dependency validation from S3/ABS config - Profiling failures now occur at runtime (not config time) - Maintains pre-PR behavior for backward compatibility 3. **Documentation updated:** - Updated PYSPARK.md to reflect new installation approach - Standard installation: pip install 'acryl-datahub[s3]' (with PySpark) - Lightweight installation: pip install 'acryl-datahub[s3-slim]' (no PySpark) - Added migration path note for future DataHub 2.0 - Explained benefits for DataHub Actions with -slim variants 4. **Tests updated:** - Removed tests expecting validation failures without PySpark - Added tests confirming config accepts profiling without validation - All tests pass with new behavior **Rationale:** This approach provides: - **Backward compatibility**: Existing users see no changes - **Migration path**: Users can opt into -slim variants now - **Future flexibility**: DataHub 2.0 can flip defaults to -slim - **No breaking changes**: Maintains pre-PR functionality - **Naming consistency**: Aligns with Docker slim/full convention **Installation examples:** \`\`\`bash pip install 'acryl-datahub[s3]' pip install 'acryl-datahub[gcs]' pip install 'acryl-datahub[abs]' pip install 'acryl-datahub[s3-slim]' pip install 'acryl-datahub[gcs-slim]' pip install 'acryl-datahub[abs-slim]' \`\`\`
…yment
Introduces slim and locked Docker image variants for both
datahub-ingestion and datahub-actions, for environments with different PySpark requirements
and security constraints.
**Image Variants**:
1. **Full (default)**: With PySpark, network enabled
- Includes PySpark for data profiling
- Can install packages from PyPI at runtime
- Backward compatible with existing deployments
2. **Slim**: Without PySpark, network enabled
- Excludes PySpark (~500MB smaller)
- Uses s3-slim, gcs-slim, abs-slim for data lake sources
- Can still install packages from PyPI if needed
3. **Locked** (NEW): Without PySpark, network BLOCKED
- Excludes PySpark
- Blocks ALL network access to PyPI/UV indexes
- datahub-actions: ONLY bundled venvs, no main ingestion install
- Most secure/restrictive variant for production
**Additional Changes**:
**1. pyspark_utils.py**: Fixed module-level exports
- Added SparkSession, DataFrame, AnalysisRunBuilder, PandasDataFrame as None
- These can now be imported even when PySpark unavailable
- Prevents ImportError in s3-slim installations
**2. setup.py**: Moved cachetools to s3_base
- operation_config.py uses cachetools unconditionally
- Now available in s3-slim without requiring data_lake_profiling
**3. build_bundled_venvs_unified.py**: Added slim_mode support
- BUNDLED_VENV_SLIM_MODE flag controls package extras
- When true: installs s3-slim, gcs-slim, abs-slim (no PySpark)
- When false: installs s3, gcs, abs (with PySpark)
- Venv named {plugin}-bundled (e.g., s3-bundled) for executor compatibility
**4. datahub-actions/Dockerfile**: Three variant structure
- bundled-venvs-full: s3 with PySpark
- bundled-venvs-slim: s3-slim without PySpark
- bundled-venvs-locked: s3-slim without PySpark
- final-full: Has PySpark, network enabled, full install
- final-slim: No PySpark, network enabled, slim install
- final-locked: No PySpark, network BLOCKED, NO main install (bundled venvs only)
**5. datahub-ingestion/Dockerfile**: Added locked stage
- install-full: All sources with PySpark
- install-slim: Selected sources with s3-slim (no PySpark)
- install-locked: Minimal sources with s3-slim, network BLOCKED
**6. build.gradle**: Updated variants and defaults
- defaultVariant: "full" (restored to original)
- Variants: full (no suffix), slim (-slim), locked (-locked)
- Build args properly set for all variants
**Network Blocking in Locked Variant**:
```dockerfile
ENV UV_INDEX_URL=http://127.0.0.1:1/simple
ENV PIP_INDEX_URL=http://127.0.0.1:1/simple
```
This prevents all PyPI downloads while allowing cached packages from build.
**Bundled Venv Naming**:
- Venv named `s3-bundled` (not `s3-slim-bundled`)
- Recipe uses `type: s3` (standard plugin name)
- Executor finds `s3-bundled` venv automatically
- Slim/locked: venv uses s3-slim package internally (no PySpark)
- Full: venv uses s3 package (with PySpark)
**Testing**:
✅ Full variant: PySpark installed, network enabled
✅ Slim variant: PySpark NOT installed, network enabled, s3-bundled venv works
✅ Integration tests: 12 tests validate s3-slim functionality
**Build Commands**:
```bash
./gradlew :datahub-actions:docker
./gradlew :docker:datahub-ingestion:docker
./gradlew :datahub-actions:docker -PdockerTarget=slim
./gradlew :docker:datahub-ingestion:docker -PdockerTarget=slim
./gradlew :datahub-actions:docker -PdockerTarget=locked
./gradlew :docker:datahub-ingestion:docker -PdockerTarget=locked
./gradlew :datahub-actions:docker -PmatrixBuild=true
./gradlew :docker:datahub-ingestion:docker -PmatrixBuild=true
```
**Recipe Format** (works with all variants):
```yaml
source:
type: s3 # Use of existing "s3" source type
config:
path_specs:
- include: "s3://bucket/*.csv"
profiling:
enabled: false # Required for slim/locked
```
Bundle ReportChanges will increase total bundle size by 9.26kB (0.03%) ⬆️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: datahub-react-web-esmAssets Changed:
|
PySpark and PyDeequ have been required dependencies for S3, ABS, and Unity Catalog
sources, even when profiling is disabled. This creates unnecessary installation
overhead (~500MB) and potential dependency conflicts for users who don't need
profiling capabilities.
PySpark Detection Framework
pyspark_utils.pywith centralized availability detectionModular Installation Options
data-lake-profilingextra for modular PySpark installations3-profiling,gcs-profiling,abs-profilingConfig Validation
Installation Examples
Dependencies
data-lake-profilingextra)Benefits
Testing
See docs/PYSPARK.md for detailed installation and usage guide.